NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

Yang, X; Tan, S; Krähenbühl, P (August 2025, https://doi.org/10.48550/arXiv.2506.17213)

An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation.
more » « less
Free, publicly-accessible full text available August 5, 2026
Interactive Post-Training for Vision-Language-Action Models

Tan, S; Dou, K; Zhao, Y; Krähenbühl, P (May 2025, https://doi.org/10.48550/arXiv.2505.17016)

We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
more » « less
Free, publicly-accessible full text available May 22, 2026
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Cho, JH; Madotto, A; Mavroudi, E; Afouras, T; Nagarajan, T; Maaz, M; Song, Y; Ma, T; Hu, S; Jain, S; et al (July 2025, https://doi.org/10.48550/arXiv.2504.13180)

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
more » « less
Free, publicly-accessible full text available July 23, 2026
Predicting a Protein's Stability under a Million Mutations

Ouyang-Zhang, J; Diaz, D J; Klivans, A; Krähenbühl, P (October 2023, https://doi.org/10.48550/arXiv.2310.12979)

Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Our Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets.
more » « less
Full Text Available
Towards Long-Form Video Understanding

Wu, C; Krähenbühl, P (April 2021, IEEE Conference on Computer Vision and Pattern Recognition)
null (Ed.)
Full Text Available
Center-based 3D Object Detection and Tracking

Yin, T; Zhou, X; Krähenbühl, P (April 2021, IEEE Conference on Computer Vision and Pattern Recognition)
null (Ed.)
Full Text Available
MEMORY OPTIMIZATION FOR DEEP NETWORKS

Shah, A; Chao-Yuan Wu, C; Jayashree Mohan, J; Vijay Chidambaram, V; Krähenbühl, P (April 2021, International Conference on Learning Representations (ICLR))
null (Ed.)
Full Text Available
Domain Adaptation Through Task Distillation

Zhou, B; Kalra, N; Krähenbühl, P (July 2020, European Conference on Computer Vision)
null (Ed.)
Full Text Available

Search for: All records